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Abstract 

Canonical correlation analysis (CCA) is a 
method for reducing the dimension of data 
represented using two views. It has been 
previously used to derive word embeddings, 
where one view indicates a word, and the 
other view indicates its context. We describe a 
way to incorporate prior knowledge into CCA, 
give a theoretical justification for it, and test it 
by deriving word embeddings and evaluating 
them on a myriad of datasets. 

1 Introduction 

In recent years there has been an immense in¬ 
terest in representing words as low-dimensional 
continuous real-vectors, namely word embeddings. 
Word embeddings aim to capture lexico-semantic 
information such that regularities in the vocabulary 
are topologically represented in a Euclidean space. 
Such word embeddings have achieved state-of-the- 
art performance on many natural language process¬ 
ing (NLP) tasks, e.g., syntactic parsing (Socher et 
al., 2013), word or phrase similarity (Mikolov et al., 
2013b), dependency parsing (Bansal et al., 2014), 
unsupervised learning (Parikh et al., 2014) and oth¬ 
ers. Since the discovery that word embeddings are 
useful as features for various NLP tasks, research on 
word embeddings has taken on a life of its own, with 
a vibrant community searching for better word rep¬ 
resentations in a variety of problems and datasets. 

These word embeddings are often induced from 
large raw text capturing distributional co-occurrence 
information via neural networks (Bengio et al., 
2003; Mikolov et al., 2013b; Mikolov et al., 
2013c) or spectral methods (Deerwester et al., 1990; 
Dhillon et al., 2015). While these general pur¬ 
pose word embeddings have achieved significant im¬ 


provement in various tasks in NLP, it has been dis¬ 
covered that further tuning of these continuous word 
representations for specific fasks improves fheir per¬ 
formance by a larger margin. Lor example, in de¬ 
pendency parsing, word embeddings could be fai- 
lored fo capfure similarify in ferms of confexf wifhin 
synfacfic parses (Bansal ef al., 2014) or fhey could 
be refined using semanfic lexicons such as WordNef 
(Miller, 1995), LrameNef (Baker el al., 1998) and 
fhe Paraphrase Dafabase (Ganifkevilch el al., 2013) 
fo improve various similarify fasks (Yu and Dredze, 
2014; Laruqui el al., 2015; Rolhe and Schulze, 
2015). This paper proposes a melhod fo encode prior 
semanfic knowledge in specfral word embeddings 
(Dhillon el al., 2015). 

Specfral learning algorifhms are of greal infer- 
esl for fheir speed, scalabilily, Iheorelical guaran¬ 
tees and performance in various NLP applications. 
These algorifhms are no slrangers fo word embed¬ 
dings eilher. In lafenl semantic analysis (LSA, 
(Deerwester el al., 1990; Landauer el al., 1998)), 
word embeddings are learned by performing SVD 
on fhe word by documenf malrix. Recenlly, Dhillon 
ef al. (2015) have proposed fo use canonical cor¬ 
relation analysis (CCA) as a melhod fo learn low¬ 
dimensional real veclors, called Eigenwords. Un¬ 
like LSA based melhods, CCA based melhods are 
scale invarianf and can capfure mulfiview informa¬ 
tion such as fhe lefl and righl confexfs of fhe words. 
As a resull, fhe eigenword embeddings of Dhillon 
el al. (2015) lhal were learned using fhe simple lin¬ 
ear melhods give accuracies comparable fo or better 
lhan slate of fhe arl when compared wilh highly non¬ 
linear deep learning based approaches (Colloberf 
and Weslon, 2008; Mnih and Hinlon, 2007; Mikolov 
el al., 2013b; Mikolov el al., 2013c). 

The main conlribulion of Ihis paper is a fechnique 


to incorporate prior knowledge into the derivation of 
canonical correlation analysis. In contrast to previ¬ 
ous work where prior knowledge is introduced in the 
off-the-shelf embeddings as a post-processing step 
(Faruqui et ah, 2015; Rothe and Schiitze, 2015), our 
approach introduces prior knowledge in the CCA 
derivation itself. In this way it preserves the the¬ 
oretical properties of spectral learning algorithms 
for learning word embeddings. The prior knowl¬ 
edge is based on lexical resources such as WordNet, 
FrameNet and the Paraphrase Database. 

Our derivation of CCA to incorporate prior 
knowledge is not limited to eigenwords and can be 
used with CCA for other problems. It follows a sim¬ 
ilar idea to the one proposed by Koren and Carmel 
(2003) for improving the visualization of principal 
vectors with principal component analysis (PCA). 
Our derivation represents the solution to CCA as 
that of an optimization problem which maximizes 
the distance between the two view projections of 
training examples, while weighting these distances 
using the external source of prior knowledge. As 
such, our approach applies to other uses of CCA in 
the NLP literature, such as the one of Jagarlamudi 
and Daume (2012), who used CCA for translitera¬ 
tion, or the one of Silberer et al. (2013), who used 
CCA for semantically representing visual attributes. 

2 Background and Notation 

For an integer n, we denote by [n] the set of integers 
{!,...,n}. We assume the existence of a vocabu¬ 
lary of words, usually taken from a corpus. This set 
of words is denoted hy H = {hi,..., h\H\}- For ^ 
square matrix A, we denote by diag(A) a diagonal 
matrix B which has the same dimensions as A such 
that Bii = An for all i. For vector v we de¬ 

note its £2 norm by ||u||, i.e. ||u|| = \jYlt=i ^0 
also denote by vj or [v\j the jth coordinate of v. For 
a pair of vectors u and v, we denote their dot product 
by {u,v). 

We define a word embedding as a function / from 
H to for some (relatively small) m. For exam¬ 
ple, in our experiments we vary m between 50 and 
300. The word embedding function maps the word 
to some real-vector representation, with the inten¬ 
tion to capture regularities in the vocabulary that are 
topologically represented in the corresponding Eu¬ 


clidean space. For example, all vocabulary words 
that correspond to city names could be grouped to¬ 
gether in that space. 

Research on the derivation of word embeddings 
that capture various regularities has greatly accel¬ 
erated in recent years. Various methods used for 
this purpose range from low-rank approximations 
of co-occurrence statistics (Deerwester et ah, 1990; 
Dhillon et ah, 2015) to neural networks jointly learn¬ 
ing a language model (Bengio et ah, 2003; Mikolov 
et ah, 2013a) or models for other NLP tasks (Col- 
lobert and Weston, 2008). 


3 Canonical Correlation Analysis for 
Deriving Word Embeddings 

One recent approach to derive word embeddings, 
developed by Dhillon et al. (2015), is through the 
use of canonical correlation analysis, resulting in so- 
called “eigenwords.” CCA is a technique for multi¬ 
view dimensionality reduction. It assumes the ex¬ 
istence of two views for a set of data, similarly to 
co-training (Yarowsky, 1995; Blum and Mitchell, 
1998), and then projects the data in the two views 
in a way that maximizes the correlation between the 
projected views. 

Dhillon et al. (2015) used CCA to derive word 
embeddings through the following procedure. They 
first break each document in a corpus of documents 
into n sequences of words of a fixed lengfh 2k + 1, 
where k is a window size. For example, if k = 2, 
fhe short documenf “Harry Poller has been a besl- 
seller” would be broken info “Harry Poller has been 
a” and “Poller has been a besl-seller.” In each such 
sequence, Ihe middle word is idenlilied as a pivol. 

This leads lo Ihe conslruclion of Ihe fol¬ 
lowing Iraining sel from a sel of documenls: 

{iw?,---,W^k\w^^,W^klv-,W^2k) I i ^ N}- 

Wilh abuse of nolalion. Ibis is a mullisel, as cer- 
lain words are expected lo appear in cerlain conlexls 
multiple times. Each rc^) is a pivol word, and Ihe 
resl of Ihe elemenls are words in Ihe sequence called 
“Ihe conlexl words.” Wilh Ihis Iraining sel in mind, 
Ihe Iwo views for CCA are defined as following. 

We define Ihe lirsl view Ihrough a sparse “conlexl 
malrix” C G ]^nx 2 fc|rt| 

malrix is a vector, consisting of 2k one-hol vectors, 
each of lenglh \H\. Each such one-hol vector corre- 



^\H\xm Y g ^2k\H\xm 



Figure 1: The word and context views represented as ma¬ 
trix W and C. Each row in FF is a vector of length |iF |, 
corresponding to a one-hot vector for the word in the ex¬ 
ample indexed by the row. Each row in C is a vector 
of length 2k\H\, divided into sub-vectors each of length 
|iT|. Each such sub-vector is a one-hot vector for one of 
the 2k context words in the example indexed by the row. 

spends to a word that fired in a specific index in the 
context. In addition, we also define a second view 
through a matrix W G such that Wij = 1 if 

= hj. We present both views of the training set 
in Figure 1. 

Note that now the matrix M = W~^C is in 
]^|H|x( 2 fc|fi'|) element Mij gives the 

count of times that hi appeared with the correspond¬ 
ing context word and context index encoded by j. 

Similarly, we define a matrix Di = diag(VF^VF) 
and D 2 = diag(C^C'). Finally, to get the word em¬ 
beddings, we perform singular value decomposition 
(SVD) on the matrix Note that in 

its original form, CCA requires use of W~^W and 
C in their full form, and not just the correspond¬ 
ing diagonal matrices Di and D 2 ; however, in prac¬ 
tice, inverting these matrices can be quite intensive 
computationally and can lead to memory issues. As 
such, we approximate CCA by using the diagonal 
matrices Di and D 2 . 

From the SVD step, we get two projections U G 


where S G is a diagonal matrix with 

Sjj > 0 being the zth largest singular value of 
In order to get the final word em¬ 
beddings, we calculate G Each 

row in this matrix corresponds to an m-dimensional 
vector for the corresponding word in the vocabulary. 
This means that f{hi) for hi £ H is the Ah row of 
the matrix U. The projection V can be used 

to get “context embeddings.” See more about this in 
Dhillon et al. (2015). 

This use of CCA to derive word embeddings 
follows the usual distributional hypothesis (Harris, 
1957) that most word embeddings techniques rely 
on. In the case of CCA, this hypothesis is trans¬ 
lated into action in the following way. CCA finds 
projections for the contexts and for the pivot words 
which are most correlated. This means that if a word 
co-occurs in a specific context many times (either 
directly, or transitively through similarity to other 
words), then this context is expected to be projected 
to a point “close” to the point to which the word is 
projected. As such, if two words occur in a specific 
context many times, these two words are expected to 

be projected to points which are close to each other. 

— 1/2 

For the next section, we denote X = WD^ 
and Y = CD2 . To refer to the dimensions of 
X and Y generically, we denote d = \H\ and d' = 
2k\H\. In addition, we refer to the column vectors 
of U and V as ui,... ,Um and vi,... ,Vm- 

Mathematical Intuition Behind CCA The pro¬ 
cedure that CCA follows finds a projection of the 
two views in a shared space, such that the correla¬ 
tion between the two views is maximized at each co¬ 
ordinate, and there is minimal redundancy between 
the coordinates of each view. This means that CCA 
solves the following sequence of optimization prob¬ 
lems for j G [m] where aj G E^^'^ and bj G 

arg max corr (oj W~^ 

dj , bj 

such that coTT{ajW"^^akW"^) = 0, k < j 
covv{bjC~^, bkC^) = 0, k < j 




































































where corr is a funetion that aeeepts two veetors 
and return the Pearson correlation between the pair¬ 
wise elements of the two vectors. The approxi¬ 
mate solution to this optimization problem (when 
using diagonal Di and D 2 ) is aj = and 

bj = for i G [m]. 

CCA also has a probabilistic interpretation as a 
maximum likelihood solution of a latent variable 
model for two normal random vectors, each drawn 
based on a third latent Gaussian vector (Bach and 
Jordan, 2005). 

The way we describe CCA for deriving word 
embeddings is related to Latent Semantic Indexing 
(LSI), which performs singular value decomposition 
on the matrix M directly, without doing any kind 
of variance normalization. Dhillon et al. (2015) de¬ 
scribe some differences between LSI and CCA. The 
extra normalization step decreases the importance of 
frequent words when doing SVD. 

4 Incorporating Prior Knowledge into 
Canonical Correlation Analysis 

In this section, we detail the technique we use to 
incorporate prior knowledge into the derivation of 
canonical correlation analysis. The main motiva¬ 
tion behind our approach is to improve the opti¬ 
mization of correlation between the two views by 
weighing them using the external source of prior 
knowledge. The prior knowledge is based on lex¬ 
ical resources such as WordNet, FrameNet and the 
Paraphrase Database. Our approach follows a sim¬ 
ilar idea to the one proposed by Koren and Carmel 
(2003) for improving the visualization of principal 
vectors with principal component analysis (PCA). It 
is also related to Laplacian manifold regularization 
(Belkin et ah, 2006). 

An important notion in our derivation is that of a 
Laplacian matrix. The Laplacian of an undirected 
weighted graph is an n x n matrix where n is the 
number of nodes in the graph. It equals D — A where 
A is the adjacency matrix of the graph (so that Aij is 
the weight for the edge (i, j) in the graph, if it exists, 
and 0 otherwise) and ZJ is a diagonal matrix such 
that Da = Aij. The Laplacian is always a sym¬ 
metric square matrix such that the sum over rows (or 
columns) is 0. It is also positive semi-definite. 

We propose a generalization of CCA, in which we 


introduce a Laplacian matrix into the derivation of 
CCA itself, as shown in Figure 2. We encode prior 
knowledge about the distances between the projec¬ 
tions of two views into the Laplacian. The Laplacian 
allows us to improve the optimization of the correla¬ 
tion between the two views by weighing them using 
the external source of prior knowledge. 

4.1 Generalization of CCA 

We present three lemmas (proofs are given in Ap¬ 
pendix A), followed by our main proposition. These 
three lemmas are useful to prove our final proposi¬ 
tion. 

The main proposition shows fhaf CCA maximizes 
fhe disfance befween fhe fwo view projections for 
any pair of examples i and j, i ^ j, while mini¬ 
mizing fhe fwo view projection disfance for fhe fwo 
views of an example i. The fwo views we discuss 
here in pracfice are fhe view of fhe word fhrough 
a one-hof represenfafion, and fhe view which repre- 
senfs fhe confexf words for a specific word foken. 
The disfance befween fwo view projecfions is de¬ 
fined in Eq. 2. 

Lemma 1. Let X and Y be two matrices of size nxd 
and n x d', respectively, for example, as defined in 
§5. Assume that Yl=i ^ij — ^ far j G [d] and 
Y?=i ^ij — 0 fa^ j ^ L be an n X n 

Laplacian matrix such that 


Lij — 


n — 1 
-1 


ifi = j 
if i ^ f 


( 1 ) 


Then LY equals X~^Y up to a multiplication by 
a positive constant. 

Lemma 2. Let A G . Then the rank m thin- 
SVD of A can be found by solving the following op¬ 
timization problem: 

m 

max uj Avi 

Xl\ , • • • , Ufn , 2=1 

? • • • ? '^m 

such that ||«j|| = ||ni|| = 1 f G [m] 

{ui,Uj) = {vi,Vj) = 0 

where Ui G denote the left singular vectors, 

and Vi G R*^^ ^ ^ denote the right singular vectors. 




Figure 2: Introducing prior knowledge in CCA. W € and C G denote the word and context views 

respectively. L G is a Laplacian matrix encoded with the prior knowledge about the distances between the 

projections of W and C. 


The last utility lemma we describe shows that in¬ 
terjecting the Laplacian between the two views can 
be expressed as a weighted sum of the distances be¬ 
tween the projections of the two views (these dis¬ 
tances are given in Eq. 2), where the weights come 
from the Laplacian. 

Lemma 3. Let ui,, Um and vi,... ,Vm be two 
sets of vectors of length d and d' respectively. Let 
L G be a Laplacian and X G and Y G 

Then: 

m 

{Yvu) = ^ -L,, {dT^f , 

k=l i,j 

where 


Y,i[Xuk]i-[Yvk]j)^y ( 2 ) 

k=l J 

The following proposition is our main result for 
this section. 

Proposition 4. The matrices U G and V G 

•^d'xm CCA computes are the m-dimensional 


projections that maximize 

n 

i,j i=l 

where dJT is defined as in Eq. 2 for ui,..., Um being 
the columns of U and ui,..., Vm being the columns 
ofV. 

Proof. According to Lemma 3, the objective in Eq. 3 
equals L{Yvk) where L is defined as 

in Eq. 1. Therefore, maximizing Eq. 3 corresponds 
to maximization of Yl'k=i{^'^k)^ L{Y Vk) under the 
constraints that the U and V matrices have orthonor¬ 
mal vectors. Using Lemma 2, it can be shown that 
the solution to this maximization is done by doing 
singular value decomposition on X^LY. Accord¬ 
ing to Lemma 1, this corresponds to finding U and 
V by doing singular value decomposition on X'^Y, 
because a multiplicative constant does not change 
the value of the right/left singular vectors. □ 

The above proposition shows that CCA tries to 
find projections of both views such that the distances 
between the two views for pairs of examples with in¬ 
dices z 7 ^ j are maximized (first term in Eq. 3), while 























































minimizing the distance between the projections of 
the two views for a specific example (second term 
in Eq. 3). Therefore, CCA tries to project a context 
and a word in that context to points that are close to 
each other in a shared space, while maximizing the 
distance between a context and a word which do not 
often co-occur together. 

As long as L is a Laplacian, Proposition 4 is still 
true, only with the maximization of the objective 

T,-hi{<es)\ (4) 

where Lij < 0 for f 7 ^ j and La > 0. This result 
lends itself to a generalization of CCA, in which we 
use predefined weighfs for fhe Laplacian fhaf encode 
some prior knowledge abouf fhe disfances fhaf fhe 
projecfions of fwo views should satisfy. 

If fhe weighl —Lij is large for a specific (i, j), 
fhen we will fry harder fo maximize fhe disfance be- 
fween one view of example i and fhe ofher view of 
example j (i.e. we will fry fo projecf fhe word 
and fhe confexf of example j info disfanf poinfs in 
fhe space). 

This means fhaf in fhe currenf formulafion, —Lij 
plays fhe role of a dissimiliarity indicafor befween 
pairs of words. The more dissimilar words are, fhe 
larger fhe weighl, and fhen fhe more disfanf fhe pro¬ 
jections are for fhe conlexls and fhe words. 

4.2 From CCA with Dissimilarities to CCA 
with Similarities 

It is often more convenient to work with similarity 
measures between pairs of words. To do that, we 
can retain the same formulation as before with the 
Laplacian, where —Lij now denotes a measure of 
similarity. Now, instead of maximizing the objective 
in Eq. 4, we are required to minimize it. 

It can be shown that such mirror formulation can 
be done with an algorithm similar to CCA, leading 
to a proposition in the style of Proposition 4. To 
solve this minimization formulation, we just need 
to choose the singular vectors associated with the 
smallest m singular values (instead of the largest). 

Once we change the CCA algorithm with the 
Laplacian to choose these projections, we can de¬ 
fine L, for example, based on a similarity graph. The 
graph is an undirected graph that has \H\ nodes, for 


Inputs: Set of examples 

integer m, an a € (0,1], an undirected graph G over 
H, an integer N. 

Data structures: 

A matrix M of size \H\ x {2k\H\) (cross-covariance 
matrix), a matrix U corresponding to the word embed¬ 
dings 

Algorithm: 

(Cross-covariance estimation) Vz,j € [n] suchthat|z — 

Jl 

• If z = j, increase M^s by 1 for r denoting the in¬ 
dex of word wii) and for all s denoting the context 
indices of words ,..., and zv^l^) ■ • ■ i ■ 

• If i ^ j and word zv^®) is connected to word 

in G, increase Mrs by a for r denoting the index of 
word zw*^®^ and for all s denoting the context indices 
of words wl ,..., zcj, and ,..., ■ 

• Calculate Di and D 2 as specified in §3. 

(Singular value decomposition step) 

• Perform singular value decomposition on 

to get a matrix U G 

(Word embedding projection) 

• For each word hi for z G [|i4|] return the word em¬ 
bedding that corresponds with the zth row of U. 


Figure 3: The CCA-like algorithm that returns word em¬ 
beddings with prior knowledge encoded based on a simi¬ 
larity graph. 

each word in the vocabulary, and there is an edge be¬ 
tween a pair of words whenever the two words are 
similar to each other based on some external source 
of information, such as WordNet (for example, if 
they are synonyms). 

We then define the Laplacian L such that Lij = 
—1 if z and j are adjacent in the graph (and i ^ j), 
La is the degree of the node i and Lij = 0 in all 
other cases. By using this variant of CCA, we strive 
to maximize the distance of the two views between 
words which are adjacent in the graph (or continuing 
the example above, maximize the distance between 
words which are not synonyms). In addition, the 
fewer adjacent nodes a word has (or the more syn¬ 
onyms it has), the less important it is to minimize the 
distance between the two views of that given word. 




4.3 Final Algorithm 

In order to use an arbitrary Laplaeian matrix with 
CCA, we require that the data is eentered, i.e. that 
the average over all examples of eaeh of the eoordi- 
nates of the word and eontext veetors is 0. However, 
sueh a prerequisite would make the matriees C and 
W dense (with many non-zero values), and hard to 
maintain in memory, and would also make singular 
value deeomposition ineffieient. 

As sueh, we do not eenter the data to keep it 
sparse, and as sueh, use a matrix L whieh is not 
strietly a Laplaeian, but that behaves better in prae- 
tiee. ^ Given the graph mentioned in §4 whieh is ex- 
traeted from an external souree of information, we 
use L sueh that Lij = a for an a G (0,1) whieh 
is treated as a smoothing faetor for the graph (see 
below the ehoiees of a) if i and j are not adjaeent 
in the graph, L^j = 0 if i 7 ^ j are adjaeent, and 
finally La = 1 for all i G [n]. Therefore, this ma¬ 
trix is symmetrie, and the only eonstraint it does not 
satisfy is that of rows and eolumns summing to 0. 

Seanning the doeuments and ealeulating the 
statistie matrix with the Laplaeian is eomputation- 
ally infeasible with a large number of tokens given 
as input. It is quadratie in that number. As sueh, 
we make another modifieation to the algorithm, and 
ealeulate a “loeal” Laplaeian. The modifieation re¬ 
quires an integer N as input (we use N = 12), 
and then it makes updates to pairs of word tokens 
only if they are within an A^-sized window of eaeh. 
The final algorithm we use is deseribed in Figure 3. 
The algorithm works by direetly eomputing the eo- 
oeeurrenee matrix M (instead of maintaining W and 
C). It does so by inereasing by 1 any eells eorre- 
sponding to word-eontext eo-oeeurrenee in the doe¬ 
uments and by a any eells eorresponding to word 
and eontexts that are eonneeted in the graph. 

5 Experiments 

In this seetion we deseribe our experiments. 

5,1 Experimental Setup 

Training Data We used three datasets, WiKll, 
Wiki 2 and WlKl5, all based on the first 1, 2 and 

*We note that other decompositions, such as PCA, also re¬ 
quire centering of the data, but in case of sparse data matrix, 
this step is not performed. 


5 billion words from Wikipedia respeetively.^ Eaeh 
dalasef is broken into ehunks of length 13 (window 
sizes of 6), eorresponding to a doeument. The above 
Laplaeian L is ealeulated within eaeh doeument sep¬ 
arately. This means that —Lij is 1 only if i and j 
denote two words that appear in the same doeument. 
This is done to make the ealeulations eomputation- 
ally feasible. We ealeulate word embeddings for the 
top most frequent 200K words. 

Prior Knowledge Resources We eonsider three 
sourees of prior knowledge: WordNet (Miller, 
1995), the Paraphrase Database of Ganitkeviteh et 
al. (2013), abbreviated as PPDB,^ and FrameNet 
(Baker et al., 1998). Sinee FrameNet and WordNet 
index words in their base form, we use WordNet’s 
stemmer to identify the base form for the text in our 
eorpora whenever we ealeulate the Faplaeian graph. 
For WordNet, we have an edge in the graph if one 
word is a synonym, hypernym or hyponym of the 
other. For PPDB, we have an edge if one word is 
a paraphrase of the other, aeeording to the database. 
For FrameNet, we eonneet two words in the graph if 
they appear in the same frame. 

System Implementation We modified the imple¬ 
mentation of the SWELL Java paekage^ of Dhillon 
et al. (2015). Speeifieally, we needed to modify the 
loop that iterates over words in eaeh doeument to a 
nested loop that iterates over pairs of words, in or¬ 
der to eompute a sum of the form Y2ij ^riLijYjg.^ 
Dhillon et al. (2015) use window size k = 2, whieh 
we retain in our experiments.^ 

5.2 Baselines 

Off-the-shelf Word Embeddings We eompare 
our word embeddings with existing state-of-the- 

^We downloaded the data from https://dumps. 
wikimedia . org/, and preprocessed it using the tool avail¬ 
able at http://mattmahoney.net/dc/textdata. 
html. 

^We use the XL subset of the PPDB. 

^https://github.com/paramveerdhillon/ 
swell. 

^Our implementation and the word embeddings that we 
calculated are available at http: //cohort. inf . ed. ac . 
uk/cohort/eigen/. 

®We also use the square-root transformation as mentioned in 
Dhillon et al. (2015) which controls the variance in the counts 
accumulated from the corpus. See a justification for this trans¬ 
form in Stratos et al. (2015). 
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89.9 

81.3 

81.7 

81.2 

80.7 


a = 0.1 

- 

59.1 

59.6 

59.5 

- 

88.9 

88.7 

89.9 

- 

81.0 

82.4 

81.0 

o 

P 

II 

o 

to 

- 

59.9 

60.6 

60.0 

- 

89.1 

91.3 

90.1 

- 

81.0 

81.3 

80.7 

Pu 

< 

a = 0.5 

- 

59.9 

59.7 

59.6 

- 

86.9 

89.3 

89.3 

- 

81.8 

81.4 

80.9 

u 

u 

P 

II 

o 

- 

60.7 

59.3 

59.5 

- 

86.9 

89.3 

92.9 

- 

80.3 

81.2 

80.8 


a = 0.9 

- 

60.6 

59.6 

58.9 

- 

89.1 

93.2 

92.5 

- 

81.3 

80.7 

81.0 

Uh 

CC 

a = 0.1 

- 

61.9 

63.6 

61.5 

- 

76.0 

71.9 

89.9 

- 

81.4 

81.7 

81.2 

+ 

a = 0.2 

- 

62.6 

64.9 

61.6 

- 

78.0 

69.3 

90.1 

- 

81.7 

81.1 

80.6 

'Sh 

a = 0.5 

- 

62.7 

63.7 

61.4 

- 

74.9 

67.3 

92.9 

- 

81.9 

81.4 

80.0 

< 

P 

II 

o 

- 

63.3 

63.0 

61.0 

- 

77.4 

65.6 

90.3 

- 

81.0 

80.8 

80.4 

u 

u 

a = 0.9 

- 

62.0 

63.3 

60.4 

- 

77.3 

66.2 

92.5 

- 

81.0 

80.7 

80.4 


Table 1: Results for the word similarity datasets, geographic analogies and NP bracketing. The first upper blocks 
(A-C) present the results with retrofitting. NPK stands for no prior knowledge (no retrofitting is used), WN for 
WordNet, PD for PPDB and FN for FrameNet. Glove, Skip-Gram, Global Context, Multilingual and Eigen are the 
word embeddings of Pennington et al. (2014), Mikolov et al. (2013b), Huang et al. (2012), Faruqui and Dyer (2014) 
and Dhillon et al. (2015) respectively. The second middle blocks (D-F) show the results of our eigenword embeddings 
encoded with prior knowledge using our method. Each row in the block corresponds to a specific use of an a value 
(smoothing factor), as described in Eigure 3. In the lower blocks (G-I) we take the word embeddings from the second 
block, and retrofit them using the method of Earuqui et al. (2015). Best results in each block are in bold. 


art word embeddings, sueh as Glove (Pennington 
et al., 2014), Skip-Gram (Mikolov et al., 2013b), 
Global Context (Huang et al., 2012) and Multilin¬ 
gual (Faruqui and Dyer, 2014). We also eompare our 
word embeddings with the Eigen word embeddings 
of Dhillon et al. (2015) without any prior knowl¬ 
edge. 

Retrofitting for Prior Knowledge We eompare 
our approaeh of ineorporating prior knowledge into 
the derivation of CCA against the previous works 
where prior knowledge is introdueed in the off-the- 
shelf embeddings as a post-proeessing step (Faruqui 
et al., 2015; Rothe and Sehiitze, 2015). In this pa¬ 
per, we foeus on the retrofitting approaeh of Faruqui 
et al. (2015). 

Retrofitting works by optimizing an objeetive 
funetion whieh has two terms: one that tries to keep 
the distanee between the word veetors elose to the 
original distanees, and the other whieh enforees the 
veetors of words whieh are adjaeent in the prior 
knowledge graph to be elose to eaeh other in the new 


embedding spaee. We use the retrofitting paekage^ 
to eompare our results in different settings against 
the results of retrofitting of Faruqui et al. (2015). 

5.3 Evaluation Benchmarks 

We evaluated the quality of our eigenword embed¬ 
dings on three different tasks: word similarity, geo- 
graphie analogies and NP braeketing. 

Word Similarity For the word similarity task we 
experimented with 11 different widely used beneh- 
marks. The WS-353-AFF dataset (Finkelstein et 
al., 2002) eonsists of 353 pairs of English words 
with their human similarity ratings. Eater, Agirre et 
al. (2009) re-annotated WS-353-AFF for similarity 
(WS-353-SIM) and relatedness (WS-353-REF) with 
speeifie distinetions between them. The SimFex- 
999 dataset (Hill et al., 2015) was built to measure 
how well models eapture similarity, rather than relat¬ 
edness or assoeiation. The MEN-TR-3000 dataset 
(Bruni et al., 2014) eonsists of 3000 word pairs 

'^https : //github . com/mfaruqui/ 
retrofitting. 



sampled from words that occur at least 700 times 
in a large web corpus. The datasets, MTurk-287 
(Radinsky et ah, 2011) and MTurk-771 (Halawi 
et ah, 2012), were scored by Amazon Mechanical 
Turk workers for relatedness of English word pairs. 
The YP-130 (Yang and Powers, 2005) and Verb-143 
(Baker et ah, 2014) datasets were developed for verb 
similarity predictions. The last two datasets, MC-30 
(Miller and Charles, 1991) and RG-65 (Rubenstein 
and Goodenough, 1965) consist of 30 and 65 noun 
pairs respectively. 

For each dataset, we calculate the cosine similar¬ 
ity between the vectors of word pairs and measure 
Spearman’s rank correlation coefficient between the 
scores produced by the embeddings and human rat¬ 
ings. We report the average of the correlations on all 
11 datasets. Each word similarity task in the above 
list represents a different aspect of word similarity, 
and as such, averaging the results points to the qual¬ 
ity of the word embeddings on several tasks. We 
later analyze specific dafasefs. 

Geographic Analogies Mikolov el al. (2013c) 
creafed a fesl sel of analogous word pairs such as 
a:b c:d raising fhe analogy question of Ihe form “a 
is lo 6 as c is lo where d is unknown. We reporl 
resulls on a subsel of fhis dalasel which focuses on 
finding capilals of common counlries, e.g., Greece 

is lo Athens as Iraq is fo_ This dalasel consisls 

of 506 word pairs. For given word pairs, a:b c:d 
where d is unknown, we use Ihe vector offsel melhod 
(Mikolov el ah, 2013b), i.e., we compute a vector 
V = Vb — Va + Vc where Va, Vb and Vc are vector 
represenlalions of Ihe words a, b and c respectively; 
we Ihen relurn Ihe word d wilh Ihe grealesl cosine 
similarity to v. 

NP Bracketing Here Ihe goal is to identify Ihe 
correcl bracketing of a Ihree-word noun (Eazaridou 
el ah, 2013). For example, Ihe bracketing of annual 
(price growth) is “righl,” while Ihe bracketing of (en¬ 
try level) machine is “left” Similarly to Faruqui and 
Dyer (2015), we concatenate Ihe word vectors of Ihe 
Ihree words, and use Ibis vector for binary classifi¬ 
cation into left or righl. 

Since mosl of Ihe dalasels lhal we evaluate on in 
Ibis paper are nol slandardly separated into develop- 
menl and lesl sels, we reporl all resulls we calculated 
(wilh respecl to hyperparameter differences) and do 


nol selecl jusl a subsel of Ihe resulls. 

5.4 Evaluation 

Preliminary Experiments In our first set of ex¬ 
periments, we vary the dimension of the word em¬ 
bedding vectors. We try m G {50,100, 200, 300}. 
Our experiments showed that the results consistently 
improve when the dimension increases for all the 
different datasets. For example, for m = 50 and 
WiKll, we get an average of 46.4 on the word sim¬ 
ilarity tasks, 50.1 for m = 100, 53.4 for m = 200 
and 54.2 for m = 300. The more data are available, 
the more likely larger dimension will improve the 
quality of the word embeddings. Indeed, for WlKl5, 
we get an average of 49.4, 54.9, 57.0 and 59.5 for 
each of the dimensions. The improvements with re¬ 
spect to the dimension are consistent across all of 
our results, so we fix m at 300. 

We also noticed a consistent improvement in ac¬ 
curacy when using more data from Wikipedia. For 
example, for m = 300, using WiKll gives an av¬ 
erage of 54.1, while using WlKl2 gives an average 
of 54.9 and finally, using WlKl5 gives an average of 
59.5. We fix the dataset we use to be WlKl5. 

Results Table 1 describes the results from our first 
set of experiments. (Note that the table is divided 
into 9 distinct blocks, labeled A through I.) In gen¬ 
eral, adding prior knowledge to eigenword embed¬ 
dings does improve the quality of word vectors for 
the word similarity, geographic analogies and NP 
bracketing tasks on several occasions (blocks D-F 
compared to last row in blocks A-C). For example, 
our eigenword vectors encoded with prior knowl¬ 
edge (CCAPrior) consistently perform better than 
the eigenword vectors that do not have any prior 
knowledge for the word similarity task (59.5, Eigen 
in the first row under NPK column, versus block D). 
The only exceptions are for a = 0.1 with Word- 
Net (59.1), for a = 0.7 with PPDB (59.3) and for 
a = 0.9 with FrameNet (58.9), where a denotes the 
smoothing factor. 

In several cases, running the retrofitting algorithm 
of Faruqui et al. (2015) on top of our word embed¬ 
dings helps further, as if “adding prior knowledge 
twice is better than once.” Results for these word 
embeddings (CCAPrior-i-RF) are shown in Table 1. 
Adding retrofitting to our encoding of prior knowl- 



edge often performs better for word similarity and 
NP bracketing tasks (block D versus G and block F 
versus I). Interestingly, CCAPrior+RF embeddings 
also often perform better than eigenword vectors 
(Eigen) of Dhillon et al. (2015) when retrofitted 
using the method of Faruqui et al. (2015). For 
example, in the word similarity task, eigenwords 
retrofitted with WordNet get an accuracy of 62.2 
whereas encoding prior knowledge using both CCA 
and retrofitting gets a maximum accuracy of 63.3. 
We see the same pattern for PPDB, with 63.6 for 
“Eigen” and 64.9 for “CCAPrior+RE”. We hypoth¬ 
esize that the reason for these changes is that the 
two methods for encoding prior knowledge maxi¬ 
mize different objective functions. 

The performance with ErameNet is weaker, in 
some cases leading to worse performance (e.g., with 
Glove and SG vectors). We believe that ErameNet 
does not perform as well as the other lexicons be¬ 
cause it groups words based on very abstract con¬ 
cepts; often words with seemingly distantly related 
meanings (e.g., push and growth) can evoke the 
same frame. This also supports the findings of 
Earuqui ef al. (2015), who noticed fhaf fhe use of 
ErameNef as a prior knowledge resource for improv¬ 
ing fhe qualify of word embeddings is nof as helpful 
as ofher resources such as WordNef and PPDB. 

We nofe fhaf CCA works especially well for fhe 
geographic analogies dafasef. The qualify of eigen¬ 
word embeddings (and fhe ofher embeddings) de¬ 
grades when we encode prior knowledge using fhe 
mefhod of Earuqui el al. (2015). Our mefhod im¬ 
proves fhe qualify of eigenword embeddings. 

Global Picture of the Results When comparing 
retrofitting to CCA with prior knowledge, there is 
a noticable difference. Retrofitting performs well 
or badly, depending on the dataset, while the re¬ 
sults with CCA are more stable. We attribute this 
to the difference between how our algorithm and 
retrofitting work. Retrofitting makes a direct use of 
the source of prior knowledge, by adding a regular¬ 
ization term that enforces words which are similar 
according to the prior knowledge to be closer in the 
embedding space. Our algorithm, on the other hand, 
makes a more indirect use of the source of prior 
knowledge, by changing the co-occurence matrix on 
which we do singular value decomposition. 


Specifically, we believe that our algorithm is more 
stable to cases in which words for the task at hand 
are unknown words with respect to the source of 
prior knowledge. This is demonstrated with the ge¬ 
ographical analogies task: in that case, retrofitting 
lowers the results in most cases. The city and coun¬ 
try names do not appear in the sources of prior 
knowledge we used. 

Further Analysis We further inspected the results 
on the word similarity tasks for the RG-65 and WS- 
353-AEE datasets. Our goal was to find cases in 
which either CCA embeddings by themselves out¬ 
perform other types of embeddings or that encoding 
prior knowledge into CCA the way we describe sig¬ 
nificantly improves the results. 

Eor the WS-353-AEE dataset, the eigenword em¬ 
beddings get a correlation of 69.6. The next best 
performing word embeddings are the multilingual 
word embeddings (68.0) and skip-gram (58.3). In¬ 
terestingly enough, the multilingual word embed¬ 
dings also use CCA to project words into a low¬ 
dimensional space using a linear transformation, 
suggesting that linear projections are a good fit for 
the WS-353-AEE dataset. The dataset itself includes 
pairs of common words with a corresponding simi¬ 
larity score. The words that appear in the dataset 
are actually expected to occur in similar contexts, a 
property that CCA directly encodes when deriving 
word embeddings. 

The best performance on the RG-65 dataset is 
with the Glove word embeddings (76.6). CCA em¬ 
beddings give an accuracy of 69.7 on that dataset. 
However, with this dataset, we observe significant 
improvement when encoding prior knowledge using 
our method. Eor example, using WordNet with this 
dataset improves the results by 4.2 points (73.9). Us¬ 
ing the method of Earuqui et al. (2015) (with Word- 
Net) on top of our CCA word embeddings improves 
the results even further by 8.7 points (78.4). 

The Role of Prior Knowledge We also designed 
an experiment to test whether using distributional in¬ 
formation is necessary for having well-performing 
word embeddings, or whether it is sufficient to rely 
on the prior knowledge resource. In order to test 
this, we created a sparse matrix that corresponds to 
the graph based on the external resource graph. We 
then follow up with singular value decomposition on 



Resource 

WordSim NP Bracketing 

WordNet 

PPDB 

FrameNet 

35.9 73.6 

37.5 77.9 

19.9 74.5 


Table 2: Results on word similarity dataset (average 
over 11 datasets) and NP bracketing. The word embed¬ 
dings are derived by using SVD on the similarity graph 
extracted from the prior knowledge source (WordNet, 
PPDB and FrameNet). 

that graph, and get embeddings of size 300. Table 2 
gives the results when using these embeddings. We 
see that the results are eonsistently lower than the 
results that appear in Table 1, implying that the use 
of prior knowledge eomes hand in hand with the 
use of distributional information. When using the 
retrofitting method by Faruqui et al. on top of these 
word embeddings, the results barely improved. 

6 Related Work 

Our ideas in this paper for eneoding prior knowl¬ 
edge in eigenword embeddings relate to three main 
threads in existing literature. 

One of the threads foeuses on modifying the ob- 
jeetive of word veetor training algorithms. Yu and 
Dredze (2014), Xu et al. (2014), Fried and Duh 
(2015) and Bian et al. (2014) augment the training 
objeetive in neural language models of Mikolov et 
al. (2013a) to eneourage semantieally related word 
veetors to eome eloser to eaeh other. Wang et al. 
(2014) propose a method for jointly embedding en¬ 
tities (from FreeBase, a large eommunity-eurated 
knowledge base) and words (from Wikipedia) into 
the same eontinuous veetor spaee. Chen and de 
Melo (2015) propose a similar joint model to im¬ 
prove the word embeddings, but rather than us¬ 
ing struetured knowledge sourees their model fo¬ 
euses on diseovering stronger semantie eonneetions 
in speeifie eontexts in a text eorpus. 

Another researeh thread relies on post-proeessing 
steps to eneode prior knowledge from semantie lex- 
ieons in off-the-shelf word embeddings. The main 
intuition behind this trend is to update word vee¬ 
tors by running belief propagation on a graph ex- 
traeted from the relation information in semantie 
lexieons. The retrofitting approaeh of Faruqui et 
al. (2015) uses sueh teehniques to obtain higher 


quality semantie veetors using WordNet, FrameNet, 
and the Paraphrase Database. They report on how 
retrofitting helps improve the performanee of vari¬ 
ous off-the-shelf word veetors sueh as Glove, Skip- 
Gram, Global Context, and Multilingual, on vari¬ 
ous word similarity tasks. Rothe and Sehiitze (2015) 
also deseribe how standard word veetors ean be ex¬ 
tended to various data types in semantie lexieons, 
e.g., synsets and lexemes in WordNet. 

Most of the standard word veetor training algo¬ 
rithms use eo-oeeurrenee within window-based eon- 
texts to measure relatedness among words. Sev¬ 
eral studies question the limitations of defining re- 
lafedness in fhis way and invesfigafe if fhe word 
eo-oeeurrenee mafrix ean be eonsfruefed fo eneode 
prior knowledge direefly fo improve fhe qualify of 
word veefors. Wang ef al. (2015) invesfigafe fhe no¬ 
tion of relafedness in embedding models by ineor- 
porafing synfaefie and lexieographie knowledge. In 
speefral learning, Yih el al. (2012) augmenf fhe word 
eo-oeeurrenee mafrix on whieh LSA operates wilh 
relational information sueh fhaf synonyms will lend 
fo have positive eosine similarity, and anfonyms will 
fend fo have negalive similarities. Their veelor spaee 
represenlalion sueeessfully projeels synonyms and 
antonyms on opposite sides in fhe projeeled spaee. 
Chang el al. (2013) furlher generalize fhis approaeh 
to eneode multiple relations (and nol jusl opposing 
relations, sueh as synonyms and anfonyms) using 
mulfi-relalional LSA. 

In speefral learning, mosl of fhe sludies on in- 
eorporaling prior knowledge in word veefors foeus 
on LSA based word embeddings (Yih el ah, 2012; 
Chang el ah, 2013; Turney and Lilfman, 2005; Tur¬ 
ney, 2006; Turney and Panlel, 2010). 

From fhe leehnieal perspeelive, our work is also 
related fo fhaf of Jagarlamudi el al. (2011), who 
showed how fo generalize CCA so fhaf if uses lo- 
ealily preserving projeelions (He and Niyogi, 2004). 
They also assume fhe exislenee of a weighl mafrix 
in a multi-view selling fhaf deseribes fhe dislanees 
belween pairs of poinls in fhe Iwo views. 

More generally, CCA is an imporlanl eomponenl 
for speefral learning algorilhms in fhe unsupervised 
selling and wilh lalenl variables (Cohen el ah, 2014; 
Narayan and Cohen, 2016; Slrafos el ah, 2016). Our 
melhod for ineorporaling prior knowledge into CCA 
eould polenlially be Iransferred to Ihese algorilhms. 




7 Conclusion 


We described a method for incorporating prior 
knowledge into CCA. Our method requires a rela¬ 
tively simple change to the original canonical cor¬ 
relation analysis, where extra counts are added to 
the matrix on which singular value decomposition is 
performed. We used our method to derive word em¬ 
beddings in the style of eigenwords, and tested them 
on a set of datasets. Our results demonstrate several 
advantages of encoding prior knowledge into eigen- 
word embeddings. 
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In addition, note that if we choose Ui = m' and Vi = 
v', then the inequality above becomes an equality, and 
in addition, the objective in Eq. 5 will equal the sum of 
the m largest singular vectors such, this 

assignment to Ui and Vi maximizes the objective. □ 

Proof of Lemma 3. First, by definition of matrix multi¬ 
plication, 

m / m \ 
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Also, 


Appendix A: Proofs 
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\k^l 


\k'^l 


= n[X^Y]i,, 

where 5kk' = 1 iff fc = fc' and 0 otherwise, and the sec¬ 
ond equality relies on the assumption of the data being 
centered. □ 
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Proof of Lemma 1. The proof is similar to the one that 
appears in Koren and Carmel (2003) for Lemma 3.1. 

The only difference is the use of two views. Note that Therefore 

LY'\ij = X^ki^kk'^k'j- such, 
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Proof of Lemma 2. Without loss of generality, assume 
d < dl. Let u[,... ,u'^ be the left singular vectors of 
A and vf ... ,vf be the right ones, and ai,... ,ad be 
the singular values. Therefore A = 
addition, the objective equals (after substituting A): 


where the first two terms disappear because of the defini¬ 
tion of the Laplacian. The comparison of Eq. 6 to Eq. 7 
gives us the necessary result. □ 
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